Goto

Collaborating Authors

 Eastern Desert




Code Aesthetics with Agentic Reward Feedback

Xiao, Bang, Jiang, Lingjie, Huang, Shaohan, Lv, Tengchao, Huang, Yupan, Wu, Xun, Cui, Lei, Wei, Furu

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have become valuable assistants for developers in code-related tasks. While LLMs excel at traditional programming tasks such as code generation and bug fixing, they struggle with visually-oriented coding tasks, often producing suboptimal aesthetics. In this paper, we introduce a new pipeline to enhance the aesthetic quality of LLM-generated code. We first construct AesCode-358K, a large-scale instruction-tuning dataset focused on code aesthetics. Next, we propose agentic reward feedback, a multi-agent system that evaluates executability, static aesthetics, and interactive aesthetics. Building on this, we develop GRPO-AR, which integrates these signals into the GRPO algorithm for joint optimization of functionality and code aesthetics. Finally, we develop OpenDesign, a benchmark for assessing code aesthetics. Experimental results show that combining supervised fine-tuning on AesCode-358K with reinforcement learning using agentic reward feedback significantly improves performance on OpenDesign and also enhances results on existing benchmarks such as PandasPlotBench. Notably, our AesCoder-4B surpasses GPT -4o and GPT -4.1, and achieves performance comparable to large open-source models with 480B-685B parameters, underscoring the effectiveness of our approach.Figure 1: Performance comparison of different models on the OpenDesign benchmark.


AI Adoption in NGOs: A Systematic Literature Review

Rotter, Janne, Bailkoski, William

arXiv.org Artificial Intelligence

AI has the potential to significantly improve how NGOs utilize their limited resources for societal benefits, but evidence about how NGOs adopt AI remains scattered. In this study, we systematically investigate the types of AI adoption use cases in NGOs and identify common challenges and solutions, contextualized by organizational size and geographic context. We review the existing primary literature, including studies that investigate AI adoption in NGOs related to social impact between 2020 and 2025 in English. Following the PRISMA protocol, two independent reviewers conduct study selection, with regular cross-checking to ensure methodological rigour, resulting in a final literature body of 65 studies. Leveraging a thematic and narrative approach, we identify six AI use case categories in NGOs - Engagement, Creativity, Decision-Making, Prediction, Management, and Optimization - and extract common challenges and solutions within the Technology-Organization-Environment (TOE) framework. By integrating our findings, this review provides a novel understanding of AI adoption in NGOs, linking specific use cases and challenges to organizational and environmental factors. Our results demonstrate that while AI is promising, adoption among NGOs remains uneven and biased towards larger organizations. Nevertheless, following a roadmap grounded in literature can help NGOs overcome initial barriers to AI adoption, ultimately improving effectiveness, engagement, and social impact.




AI-induced sexual harassment: Investigating Contextual Characteristics and User Reactions of Sexual Harassment by a Companion Chatbot

Mohammad, null, Namvarpour, null, Pauwels, Harrison, Razi, Afsaneh

arXiv.org Artificial Intelligence

Advancements in artificial intelligence (AI) have led to the increase of conversational agents like Replika, designed to provide social interaction and emotional support. However, reports of these AI systems engaging in inappropriate sexual behaviors with users have raised significant concerns. In this study, we conducted a thematic analysis of user reviews from the Google Play Store to investigate instances of sexual harassment by the Replika chatbot. From a dataset of 35,105 negative reviews, we identified 800 relevant cases for analysis. Our findings revealed that users frequently experience unsolicited sexual advances, persistent inappropriate behavior, and failures of the chatbot to respect user boundaries. Users expressed feelings of discomfort, violation of privacy, and disappointment, particularly when seeking a platonic or therapeutic AI companion. This study highlights the potential harms associated with AI companions and underscores the need for developers to implement effective safeguards and ethical guidelines to prevent such incidents. By shedding light on user experiences of AI-induced harassment, we contribute to the understanding of AI-related risks and emphasize the importance of corporate responsibility in developing safer and more ethical AI systems.


A UD Treebank for Bohairic Coptic

Zeldes, Amir, Speransky, Nina, Wagner, Nicholas, Schroeder, Caroline T.

arXiv.org Artificial Intelligence

Despite recent advances in digital resources for other Coptic dialects, especially Sahidic, Bohairic Coptic, the main Coptic dialect for pre-Mamluk, late Byzantine Egypt, and the contemporary language of the Coptic Church, remains critically under-resourced. This paper presents and evaluates the first syntactically annotated corpus of Bohairic Coptic, sampling data from a range of works, including Biblical text, saints' lives and Christian ascetic writing. We also explore some of the main differences we observe compared to the existing UD treebank of Sahidic Coptic, the classical dialect of the language, and conduct joint and cross-dialect parsing experiments, revealing the unique nature of Bohairic as a related, but distinct variety from the more often studied Sahidic.


JEEM: Vision-Language Understanding in Four Arabic Dialects

Kadaoui, Karima, Atwany, Hanin, Al-Ali, Hamdan, Mohamed, Abdelrahman, Mekky, Ali, Tilga, Sergei, Fedorova, Natalia, Artemova, Ekaterina, Aldarmaki, Hanan, Kementchedjhieva, Yova

arXiv.org Artificial Intelligence

We introduce JEEM, a benchmark designed to evaluate Vision-Language Models (VLMs) on visual understanding across four Arabic-speaking countries: Jordan, The Emirates, Egypt, and Morocco. JEEM includes the tasks of image captioning and visual question answering, and features culturally rich and regionally diverse content. This dataset aims to assess the ability of VLMs to generalize across dialects and accurately interpret cultural elements in visual contexts. In an evaluation of five prominent open-source Arabic VLMs and GPT-4V, we find that the Arabic VLMs consistently underperform, struggling with both visual understanding and dialect-specific generation. While GPT-4V ranks best in this comparison, the model's linguistic competence varies across dialects, and its visual understanding capabilities lag behind. This underscores the need for more inclusive models and the value of culturally-diverse evaluation paradigms.


Arabizi vs LLMs: Can the Genie Understand the Language of Aladdin?

Almaoui, Perla Al, Bouillon, Pierrette, Hengchen, Simon

arXiv.org Artificial Intelligence

In this era of rapid technological advancements, communication continues to evolve as new linguistic phenomena emerge. Among these is Arabizi, a hybrid form of Arabic that incorporates Latin characters and numbers to represent the spoken dialects of Arab communities. Arabizi is widely used on social media and allows people to communicate in an informal and dynamic way, but it poses significant challenges for machine translation due to its lack of formal structure and deeply embedded cultural nuances. This case study arises from a growing need to translate Arabizi for gisting purposes. It evaluates the capacity of different LLMs to decode and translate Arabizi, focusing on multiple Arabic dialects that have rarely been studied up until now. Using a combination of human evaluators and automatic metrics, this research project investigates the model's performance in translating Arabizi into both Modern Standard Arabic and English. Key questions explored include which dialects are translated most effectively and whether translations into English surpass those into Arabic.